Document Clustering using Word Sense Disambiguation

نویسندگان

  • M. S. Mostafa
  • M. H. Haggag
  • W. H. Gomaa
چکیده

In computational linguistics, word sense disambiguation (WSD) is the problem of determining in which sense a word having a number of distinct senses is used in a given sentence . This paper handles text document clustering as one of the major tasks of text processing. Document clustering is the process of finding out groups of information from the text documents and cluster these documents into the most relevant groups. Large document corpus suffers from ambiguity problems like synonyms, polysemous and other semantic relations. For this reason we perform WSD task for all terms in all documents to get the best sense to be used as document features in the clustering process. Our experimental results proved that the efficiency of document clustering using WSD increases linearly with the size of the documents dataset. Different part of speech (POS) taggers were tested to determine the best; also the effect of different window sizes on WSD task was compared.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph-based Word Clustering using a Web Search Engine

Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By cal...

متن کامل

بررسی نقش انواع بافتار هم‌نویسه‌ها در تعیین شباهت بین مدارک

Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...

متن کامل

An Efficient Technique to Improve Snippet Clustering

Document clustering is an effective tool to manage information overload. By grouping similar documents together, we enable a human observer to quickly browse large document collections, make it possible to easily grasp the distinct topics and subtopics. In this Paper we survey the most important problems and techniques related to text information retrieval: document pre-processing and filtering...

متن کامل

Improving Summarization of Biomedical Documents Using Word Sense Disambiguation

We describe a concept-based summarization system for biomedical documents and show that its performance can be improved using Word Sense Disambiguation. The system represents the documents as graphs formed from concepts and relations from the UMLS. A degree-based clustering algorithm is applied to these graphs to discover different themes or topics within the document. To create the graphs, the...

متن کامل

Meaningful Clusters

We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008